A Feature Weight Adjustment Algorithm for Document Categorization

نویسندگان

  • Shrikanth Shankar
  • George Karypis
چکیده

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-speci ed classes (topics or themes) of documents, is an important task that can help both in organizing as well as in nding information on these huge resources. In this paper we present a fast iterative feature weight adjustment algorithm for the linear-complexity centroid based classi cation algorithm. Our algorithm uses a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. We experimentally evaluate our algorithm on the Reuters-21578 and OHSUMED document collection and compare it against a variety of other categorization algorithms. Our experiments show that feature weight adjustment improves the performance of the centroid-based classi er by 2%{5% , substantially outperforms Rocchio andWidrow-Ho and is competitive with SVM.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier *

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...

متن کامل

Weight adjustment schemes for a centroid based classifier ∗

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...

متن کامل

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-neares...

متن کامل

Cluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization

An efficient cluster based hybrid niche mimetic and genetic algorithm for text document categorization to improve the retrieval rate of relevant document fetching is addressed. The proposal minimizes the processing of structuring the document with better feature selection using hybrid algorithm. In addition restructuring of feature words to associated documents gets reduced, in turn increases d...

متن کامل

Class-Based Weighted NB for Text Categorization

Naïve Bayes classifier is a supervised and probabilistic learning method (Manning, Raghavan, & Schuetze, 2008) which greatly simplifies learning by making the assumption that provided features are conditionally independent. Although this assumption usually does not hold, this classifier proves to compete well with other more sophisticated techniques (Rish, 2001). Moreover, being fast and easy t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000